Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use aligned_vec crate to eliminate unsound code #24

Merged
merged 4 commits into from
Feb 20, 2024
Merged

Conversation

shssoichiro
Copy link
Collaborator

This was created from the discussion in #23, to see how much performance impact we would have from swapping to a crate that provides an aligned Vec implementation. From the benchmarks within this crate, there seems to be some small amount of both positive and negative impact, depending on the benchmark at test. Change numbers below are compared to main branch.

frame new_with_padding padding=0
                        time:   [4.1241 µs 4.1381 µs 4.1629 µs]
                        change: [-7.6654% -7.1637% -6.7668%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 19 outliers among 100 measurements (19.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  2 (2.00%) high mild
  11 (11.00%) high severe

frame new_with_padding padding!=0
                        time:   [6.6522 µs 6.6544 µs 6.6569 µs]
                        change: [-9.7115% -9.3830% -9.0024%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  8 (8.00%) high mild
  3 (3.00%) high severe

plane new padding=0     time:   [2.4437 µs 2.4523 µs 2.4656 µs]
                        change: [-1.6874% -1.4127% -1.0478%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

plane new padding!=0    time:   [3.3975 µs 3.4143 µs 3.4357 µs]
                        change: [-8.4273% -7.4063% -6.7211%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

plane clone             time:   [5.8992 µs 5.9049 µs 5.9141 µs]
                        change: [+8.0155% +8.3304% +8.5803%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

plane pad               time:   [5.2718 µs 5.2953 µs 5.3186 µs]
                        change: [-1.3709% +0.5738% +2.5759%] (p = 0.57 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) low severe
  5 (5.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

plane copy_from_raw_u8 8-bit
                        time:   [8.3387 µs 8.3634 µs 8.3884 µs]
                        change: [+4.9790% +6.3446% +7.7890%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) low severe
  6 (6.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

plane copy_from_raw_u8 10-bit
                        time:   [9.3816 µs 9.4546 µs 9.5246 µs]
                        change: [+5.9267% +7.0326% +8.2200%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  11 (11.00%) low mild
  2 (2.00%) high mild
  4 (4.00%) high severe

plane downsampled       time:   [11.012 µs 11.063 µs 11.152 µs]
                        change: [+4.6064% +5.0602% +5.6612%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe

plane downscale         time:   [9.6774 µs 9.6804 µs 9.6839 µs]
                        change: [+5.1378% +5.3110% +5.4371%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

plane rows_iter         time:   [14.605 µs 14.677 µs 14.830 µs]
                        change: [-0.2160% +0.0117% +0.4976%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) high mild
  4 (4.00%) high severe

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2024

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (7184cb8) 65.58% compared to head (f789925) 66.52%.

Files Patch % Lines
src/plane.rs 95.23% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #24      +/-   ##
==========================================
+ Coverage   65.58%   66.52%   +0.93%     
==========================================
  Files           4        4              
  Lines         985      923      -62     
==========================================
- Hits          646      614      -32     
+ Misses        339      309      -30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/plane.rs Outdated
for v in pd.iter_mut() {
*v = T::cast_from(128);
Self {
data: avec_rt!([Self::DATA_ALIGNMENT]| T::cast_from(0); len),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably not use avec_rt because it's technically a runtime-variable alignment. You could use AVec::<T, ConstAlign<Self::DATA_ALIGNMENT>> and manually create it. Bit less convenient but carries a compile-time alignment guarantee.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use from_iter and keep in mind we want 128 as value.

Copy link
Collaborator Author

@shssoichiro shssoichiro Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my assumption was that because we needed the alignment to be different on wasm, we needed to use a runtime alignment here, but I suppose we could do that in a different way and still use const align.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is at compile time anyway, so it is a matter of using type A = ConstAlign<DATA_ALIGNMENT> and then use it where needed.

@lu-zero
Copy link
Member

lu-zero commented Feb 11, 2024

That crate has also an ABox, we can use into_boxed_slice, maybe it improves even better the situation.

Copy link
Member

@lu-zero lu-zero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, overall, few suggestions up beside using ABox.

src/plane.rs Outdated Show resolved Hide resolved
src/plane.rs Outdated
for v in pd.iter_mut() {
*v = T::cast_from(128);
Self {
data: avec_rt!([Self::DATA_ALIGNMENT]| T::cast_from(0); len),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use from_iter and keep in mind we want 128 as value.

@FreezyLemon
Copy link
Contributor

FreezyLemon commented Feb 11, 2024

I'd like to try at a manual implementation without the soundness bugs I've noticed. We could compare the two to see if there are performance differences because comparing to main is not very useful in that regard (correct vs incorrect code)

src/plane.rs Outdated Show resolved Hide resolved
src/plane.rs Outdated Show resolved Hide resolved
@FreezyLemon
Copy link
Contributor

I think I have a working implementation here in my fork. I'll have to double-check some stuff, but it's probably good enough for some preliminary benchmarking. My notebook is probably not reliable enough for these sensitive benches (~5% differences could also be caused by CPU boosting etc., which is more pronounced on mobile platforms)

Co-authored-by: Luca Barbato <[email protected]>
@shssoichiro
Copy link
Collaborator Author

shssoichiro commented Feb 11, 2024

Yes, I'm also uncertain about the reliability of my benchmarks for the same reason. My original thought was that being less than 10% puts them within a range where noise could be the cause, given the very small scale of time that these benchmarks are running on. Though, I would personally prefer the version where we don't have to maintain any unsafe code, assuming performance is similar.

@FreezyLemon
Copy link
Contributor

I just meant that a laptop/mobile device isn't the best to do benchmarks on..

Though, I would personally prefer the version where we don't have to maintain any unsafe code, assuming performance is similar.

Generally agree, but unsafe code might buy us something here. E.g. the opportunity to (correctly and soundly) use uninitialized memory for better performance. Which is probably not possible (or a lot more complicated) with structs defined externally.

Not 100% sure where I stand on this.

@shssoichiro
Copy link
Collaborator Author

We do use uninitialized vecs in a few other places in rav1e. I'm not a fan of it, but it does give performance improvements in some places--though in other places I've found it to give no benefit, likely due to compiler optimizations being used in some locations.

@FreezyLemon
Copy link
Contributor

I've tried for some time now, and I can't seem to get stable benchmark results even on my desktop machine. I think it's fine to just go ahead with this (over a manual implementation) without investigating this much further, but I also don't know how much of an impact this could have for consumers (esp. rav1e).

Optimizations can still be done afterwards, and having reliably sound code seems more important.

@lu-zero
Copy link
Member

lu-zero commented Feb 19, 2024

To improve the situation we'd need a MaybeFrame/MaybePlane I'm afraid. But that can be done later. I think.

@shssoichiro
Copy link
Collaborator Author

Based on the discussion, it sounds like we're good to go ahead and merge this. So I'll do so.

@shssoichiro shssoichiro merged commit 0244b29 into main Feb 20, 2024
3 checks passed
@shssoichiro shssoichiro deleted the aligned-vec branch February 20, 2024 19:41
*v = T::cast_from(128);
Self {
data: AVec::from_iter(
Self::DATA_ALIGNMENT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After skimming through aligned-vec, it's fine (and maybe preferable?) to just pass 0 here. That way there'd be no need for another target-dependent const variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants